Development and Performance Analysis of a Fault Tolerant Algorithm for Cluster of Workstations
نویسندگان
چکیده
A Cluster of Workstations (COW) is network based multi-computer system, which is the most prominent distributed memory system aimed to replace supercomputers. A cluster of workstations can be viewed as a single machine in which one job is divided into n subtasks and delegated to n workstations in the COW architecture. To get the job completed, all subtasks assigned to component workstations must be completed. Therefore, for satisfactory job completion, all workstations must be functional. However, a faulty node can suspend the over all job completion task until. Therefore, a job can not be completed until a faulty node is recovered from fault. This paper presents a fault tolerant architecture for COW, which will allow a normally working workstation to perform the tasks of the faulty workstation in addition to its original assignments. The Markov models are basic tools applied for availability modeling. This paper presents a Markov Availability model for estimating the availability of component workstations as a function of workstation failure rates.
منابع مشابه
Scheduling Large Task Graphs in Parallel Using a Fault-Tolerant Heterogeneous-Cluster-Based Search
—A natural approach for scheduling tasks to a workstation cluster is to employ the multiple machines in the cluster to schedule the task graphs so that the cluster manifests itself as a “self-scheduled” platform. A few parallel approaches have been devised for scheduling task graphs using a parallel machine such as an Intel Paragon but they are not suitable for a cluster of workstations environ...
متن کاملCAFT: Cost-aware and Fault-tolerant routing algorithm in 2D mesh Network-on-Chip
By increasing, the complexity of chips and the need to integrating more components into a chip has made network –on- chip known as an important infrastructure for network communications on the system, and is a good alternative to traditional ways and using the bus. By increasing the density of chips, the possibility of failure in the chip network increases and providing correction and fault tol...
متن کاملFault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple chec...
متن کاملFault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless chec...
متن کاملVoting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems
some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004